Clustering

Clustering on the dataset: Tesla Deaths - Deaths

Brief introduction of Clustering

Clustering is a method used in machine learning and data analysis to group similar data points together based on certain features or characteristics. The goal of clustering is to partition a dataset into subsets, or clusters, where data points within the same cluster are more similar to each other than to those in other clusters. This helps in identifying patterns, structures, or hidden relationships within the data.

Summary of the dimension reduction effort:

In this analysis, the feature data X pertains to the state the accident took place, while the Y variables include ‘CP,’ ‘tsla+cp,’ ‘VTAD,’ ‘Claimed,’ ‘Tesla_occupant,’ and ‘Other_vehicle.’ Notably, conventional visualization techniques like PCA and t-SNE prove ineffective due to the absence of a distinct separation between clusters. Consequently, clustering methods such as K-means are employed in this section for a more suitable exploration of the data’s inherent patterns.

Code
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA 
import matplotlib.pyplot as plt
import numpy as np
from sklearn.manifold import TSNE
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

Import ‘KMeans’ library and use a for-loop to view the explanatory power of some K values

df = pd.read_csv('./cleandata/cleanTelsa.csv')
features = df[['CP','tsla+cp', 'VTAD', 'Claimed', 'Tesla_occupant','Other_vehicle']]
target = df['State']
scaler = StandardScaler()
features_scaled = scaler.fit_transform(features)
ks = range(1, 10)
inertias = []
for k in ks:
    km = KMeans(n_clusters=k, random_state=8)
    km.fit(features_scaled)
    inertias.append(km.inertia_)
    
plt.plot(ks, inertias, marker='o')

The plot above does not clearly indicate a strong preference for any specific K-value. Therefore, opting for a cluster count of 3 or 4 seems to be a reasonable choice based on the observed data patterns.

from sklearn.decomposition import PCA
pca = PCA(n_components=2)
p_comps = pca.fit_transform(features_scaled)
p_comp1 = p_comps[:,0]
p_comp2 = p_comps[:,1]
km = KMeans(n_clusters=3,random_state=10)
km.fit(features_scaled)
plt.scatter(p_comps[:,0],p_comps[:,1],c=km.labels_)
C:\Users\23898\anaconda\lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning:

The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning

C:\Users\23898\anaconda\lib\site-packages\sklearn\cluster\_kmeans.py:1436: UserWarning:

KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
<matplotlib.collections.PathCollection at 0x23ffcbd8940>

Use 4 as the number of cluster and refit the model

It appears that there isn’t a substantial difference between utilizing k=3 and k=4. Therefore, for the sake of diversity, we will adhere to using k=4.

km = KMeans(n_clusters=4,random_state=10)
km.fit(features_scaled)
plt.scatter(p_comps[:,0],p_comps[:,1],c=km.labels_)
C:\Users\23898\anaconda\lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning:

The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning

C:\Users\23898\anaconda\lib\site-packages\sklearn\cluster\_kmeans.py:1436: UserWarning:

KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
<matplotlib.collections.PathCollection at 0x23ffd659ff0>

In summary:

A logical progression in our analysis involves examining the distinctions among the three clusters concerning the three features employed for clustering. Instead of relying on scaled features, we revert to using the unscaled features to facilitate a more interpretable exploration of the differences.

Visualize the results using a violin plot

import seaborn as sns
df['cluster'] = km.labels_
melt_car = pd.melt(df,id_vars='cluster',var_name="predictor",value_name="percent",
                 value_vars=['CP','tsla+cp', 'VTAD', 'Claimed', 'Tesla_occupant','Other_vehicle'] )
sns.violinplot(data=melt_car,y='predictor',x='percent',hue='cluster')
<Axes: xlabel='percent', ylabel='predictor'>

In summary:

From the presented plot, it is now evident that distinct clusters of states may necessitate varied interventions, such as implementing safety measures like road cushioning or enforcing more stringent traffic laws tailored to each group’s specific characteristics.

Clustering on the dataset: Traffic Accidents and Vehicles (gas car)

Code
data = pd.read_csv('./Data/RoadAccident.csv')
column_datatypes = set()
for column in data.columns:
    column_datatypes.add(str(data[column].dtype))
X = data.drop(columns='Accident_Severity')
y = data['Accident_Severity']
numerical_features = list()
categorical_features = list()
for column in X.columns:
    # In the dataset we only have float and int64.
    if (data[column].dtype == 'float64' or data[column].dtype == 'int64'):
        numerical_features.append(column)
    # Categorical
    elif (data[column].dtype == 'object'):
        categorical_features.append(column)
data = X[numerical_features]
SS=StandardScaler()
X=pd.DataFrame(SS.fit_transform(data), columns=data.columns)
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
principal_df = pd.DataFrame(data = X_pca, columns = ['PC1', 'PC2'])
kmeans = KMeans(n_clusters=5, n_init=15, max_iter=500, random_state=0)
clusters = kmeans.fit_predict(X)
centroids = kmeans.cluster_centers_
centroids_pca = pca.transform(centroids)
C:\Users\23898\anaconda\lib\site-packages\sklearn\base.py:464: UserWarning:

X does not have valid feature names, but PCA was fitted with feature names

Plot the clustering results both in 2-D and 3-D plot

plt.figure(figsize=(8,6))
plt.scatter(principal_df.iloc[:,0], principal_df.iloc[:,1], c=clusters, cmap="brg", s=40)
plt.scatter(x=centroids_pca[:,0], y=centroids_pca[:,1], marker="x", s=100, linewidths=3, color="black")
plt.title('PCA plot in 2D')
plt.xlabel('PC1')
plt.ylabel('PC2')
Text(0, 0.5, 'PC2')

pca = PCA(n_components=3)
components = pca.fit_transform(X)
import plotly.express as px
fig = px.scatter_3d(
    components, x=0, y=1, z=2, color=clusters, size=0.1*np.ones(len(X)), opacity = 1,
    title='PCA plot in 3D',
    labels={'0': 'PC 1', '1': 'PC 2', '2': 'PC 3'},
    width=650, height=500
)
fig.show()

In summary:

  1. Clustering is not distinctly visible in both 2-D and 3-D plots, primarily due to the extensive size of the dataset, leading to overlapping scatter plots. However, subsetting the dataset would improve visualization clarity.
  2. Utilizing clustering methods enables the categorization of the entire dataset into smaller groups, each representing distinct types of accidents. Subsequently, any of these groups can be selected for in-depth analysis.